feat(PLU-161): flatten_metadata option for databricks_volume_delta_tables by awalker4 · Pull Request #712 · Unstructured-IO/unstructured-ingest

awalker4 · 2026-05-20T20:24:47Z

Summary

Adds opt-in flatten_metadata: bool (default False) to DatabricksVolumeDeltaTableUploaderConfig and DatabricksVolumeDeltaTableStagerConfig.
When set, the stager recursively flattens nested metadata into top-level columns (no metadata_ prefix) using the existing flatten_dict helper (utils/data_prep.py) — stops at lists; matches Milvus's flat naming.
When set, the uploader skips auto-create and intersects incoming columns with the user-managed table schema, dropping unknowns with a log line (mirrors the existing _fit_to_schema message in sql/sql.py).
Switches the destination-registry's upload_stager_config from the empty SQL pass-class to the now-non-empty Volumes-specific stager config so the new field surfaces in the dashboard schema. Safe — verified no external imports of the swapped class and no field loss.
Backward compatibility is preserved: with the flag absent or False, behavior is byte-identical to today (single metadata VARIANT blob, default auto-create). Workflow DB-serialized configs without the field deserialize cleanly with flatten_metadata=False.
No type coercion on the flatten path — values flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields (date_created, date_modified, date_processed, and the data_source_* variants) as STRING; they arrive as stringified or float unix epochs.

Why

Linear: PLU-161. Customers want to query Delta tables via normal SQL columns (`SELECT filename, data_source_url ...`) rather than dotting into a `VARIANT` blob. Per Potter, this is also intended as the reusable pattern for other table-like connectors.

Example job run

All metadata fields are flattened. You can see languages comes in as a list,

Test plan

New unit tests: stager blob-mode default, flatten drops the metadata_ prefix, flatten stops at lists, both configs deserialize with default False for Workflow DB backcompat, uploader missing-table hard-fail, uploader PARSE_JSON wrapping on the non-flatten path, uploader unknown-column drop + log.
Existing test_databricks_delta_tables.py (SQL connector) and the destination-config JSON-schema test still pass — confirms the registry switch doesn't ripple.
SND manual matrix: cases include backcompat, default schema auto-create, flatten with custom table, missing-table hard-fail, nullable columns, nested keys, list-valued columns. Pending plugin-image build off this branch.

🤖 Generated with Claude Code

…bles Adds an opt-in `flatten_metadata` flag to the Volumes Delta Tables uploader. When set, the stager recursively flattens the metadata dict into top-level `metadata_*` columns (stops at lists, preserves the `metadata_` prefix per the PRD) using the existing `flatten_dict` helper. The uploader skips auto-create and intersects incoming columns with the user-managed table schema, dropping unknowns with a log line. Default is False so existing Workflow DB-serialized configs continue to behave identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stringified unix epochs (e.g. metadata.data_source.date_processed = "1779329564.5102773") are rejected by Databricks's implicit string -> TIMESTAMP cast with CAST_INVALID_INPUT. Run flattened datetime fields through the SQL connector's existing parse_date_string and emit ISO format so TIMESTAMP columns coerce natively. Malformed values pass through unchanged so the table can reject them, matching the non-flatten path's semantics. The known datetime field set is imported from sql.sql._DATE_COLUMNS so this stays in sync with the rest of ingest as new datetime fields land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

awalker4 · 2026-05-21T18:29:00Z

Finally address unrelated flaky test: #714

Let metadata fields flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields as STRING. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

_{Shadow auto-approve: would not auto-approve because issues were found.
Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic}

… modes After dropping coercion in 0c31cb2 the existing tests no longer exercised how datetime metadata flows through to Delta. Pin the new contract: string and float epochs pass through unchanged in flatten mode, and the blob mode keeps them byte-identical inside the JSON column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

0 issues found across 1 file (changes from recent commits).

_{Shadow auto-approve: would require human review. This PR adds opt-in metadata flattening to Databricks volume delta tables, modifying core stager and uploader logic with conditional branching, column filtering, and auto-create bypass—changes that carry risk to data integrity and require human review to ensure correctness before merging.

Re-trigger cubic}

…issing Adds a flatten-gated SHOW TABLES check to the Volumes Delta Tables precheck so a misconfigured workflow fails at save time instead of per-document at run time. create_destination under flatten mode becomes an early-return no-op so precheck is the single source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

0 issues found across 2 files (changes from recent commits).

_{Shadow auto-approve: would require human review. This PR adds an opt-in flatten_metadata feature that modifies stager and uploader logic, including metadata handling, table auto-create skip, and column intersection—changes that affect core ingestion behavior and could have subtle edge cases despite being default-off, so human review is warranted.

Re-trigger cubic}

potter-potter · 2026-05-22T17:18:31Z

@claude review

claude · 2026-05-22T17:18:50Z

Claude finished @potter-potter's task —— View job

Code Review for PLU-161: flatten_metadata option

Todo List:

Read repository guidelines (CLAUDE.md) - None found
Review main implementation in volumes_table.py
Examine test coverage and quality
Check configuration changes
Review changelog and version updates
Provide comprehensive review feedback

✅ Overall Assessment

This is a well-implemented feature that adds opt-in metadata flattening to Databricks Volume Delta Tables. The implementation is thoughtful, maintains backward compatibility, and includes comprehensive test coverage.

🔍 Detailed Review

Implementation Quality: Excellent

Strengths:

Clean Design: The flatten_metadata flag is properly implemented in both stager and uploader configs with sensible defaults (False for backward compatibility)
Proper Separation: Stager handles flattening logic, uploader handles schema intersection - good separation of concerns
Error Handling: Precheck validation ensures table exists when flatten mode is enabled, preventing runtime failures
Logging: Appropriate info-level logging when dropping unknown columns

Key Implementation Details:

Stager flattens metadata using existing flatten_dict helper with flatten_lists=False - volumes_table.py:88-95
Uploader intersects columns with destination schema and logs drops - volumes_table.py:219-227
Auto-create is disabled in flatten mode as documented - volumes_table.py:117-119
Backward compatibility preserved through False default - volumes_table.py:44-50

Test Coverage: Comprehensive

The test suite covers all critical paths:

✅ Default blob mode behavior (test_stager_blob_mode_is_default)
✅ Metadata flattening without prefixes (test_stager_flatten_drops_metadata_prefix)
✅ List preservation (test_stager_flatten_stops_at_lists)
✅ Datetime passthrough contracts for both string and float epochs
✅ Configuration backward compatibility (test_flatten_metadata_defaults_false_for_workflow_db_backcompat)
✅ Uploader behaviors: precheck validation, column intersection, SQL generation
✅ Error cases: missing tables, unknown columns

The tests properly verify the datetime handling contract mentioned in the PR description.

Security & Robustness: Good

No SQL injection risks - uses parameterized queries and proper quoting via quote_identifier
Fail-fast precheck prevents silent data loss in flatten mode
Column intersection prevents schema drift issues
No secrets or sensitive data handling concerns

Documentation & Usability: Clear

Field descriptions clearly explain the feature and requirements
Changelog entry is accurate and informative
Error messages are descriptive (e.g., "auto-create is disabled to prevent silent data loss")

🔧 Technical Notes

Schema Intersection Logic: The uploader correctly filters incoming columns against destination schema (volumes_table.py:220-221), which is essential for user-managed tables.
PARSE_JSON Handling: The implementation correctly wraps metadata with PARSE_JSON() only in non-flatten mode (volumes_table.py:231), maintaining the existing VARIANT behavior.
Datetime Contract: No type coercion occurs in flatten mode - values flow through as their JSON-native types (strings or floats), which is the intended behavior per the PR description.

⚠️ Minor Observations

Registry Switch: The destination registry now uses the Volumes-specific stager config instead of the empty SQL pass-class. This is safe as verified by the author, but worth noting for future maintenance.
Error Context: The precheck error message could include the actual catalog/database context, but the current message is sufficiently clear.

✅ Approval Recommendation

This PR is ready for merge. The implementation is solid, well-tested, and follows established patterns in the codebase. The feature addresses a clear customer need while maintaining full backward compatibility.

The addressing of previous review comments (float type handling, test coverage improvements) demonstrates attention to feedback and code quality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

0 issues found across 2 files (changes from recent commits).

_{Shadow auto-approve: would require human review. Adds a new flatten_metadata feature that modifies core stager and uploader logic, including conditional auto-create skipping, schema intersection, and precheck queries, which introduces non-trivial business logic risk and requires human review for correctness and safety.

Re-trigger cubic}

potter-potter · 2026-05-22T18:59:01Z

@claude please give one more review. we dropped the datetime and timestamp conversions because we dont want to do that anymore

claude · 2026-05-22T18:59:17Z

Claude finished @potter-potter's task —— View job

Code Review for PLU-161 (Updated Review)

Todo List:

Read current implementation files to understand datetime changes
Review test coverage for datetime handling contract
Check for any remaining datetime conversion logic
Provide updated review feedback based on dropped conversions

✅ Updated Review - Datetime/Timestamp Conversion Removal

After reviewing the current implementation, I can confirm that the datetime and timestamp conversions have been properly removed. Here's my updated assessment:

Key Changes Confirmed:

No Datetime Coercion Logic: The implementation in volumes_table.py has no datetime conversion code. Values flow through unchanged as their JSON-native types (strings or floats).
Test Coverage Updated: The test suite now properly covers the datetime pass-through contract:
- test_stager_flatten_passes_datetime_strings_through_unchanged() - verifies string epochs remain as strings
- test_stager_flatten_passes_float_epoch_through_unchanged() - verifies numeric epochs remain as floats
- test_stager_blob_mode_preserves_datetime_strings() - verifies non-flatten path is unchanged
Clean Implementation: The stager simply calls flatten_dict() with flatten_lists=False and no additional processing (volumes_table.py:88-95).

Implementation Quality: Excellent

Strengths of the Current Approach:

Simplicity: No complex datetime parsing or conversion logic to maintain
Performance: No overhead from type checking and conversion
Flexibility: Customers control datetime field types in their Delta table schema
Consistency: Both string and float epochs are preserved exactly as received

Documentation & Contract:

The PR description and field descriptions clearly state the new contract:

"No type coercion on the flatten path — values flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields (date_created, date_modified, date_processed, and the data_source_* variants) as STRING"

Test Coverage: Comprehensive

The updated test suite properly validates:

✅ String epochs pass through unchanged: "1779329564.5102773" → "1779329564.5102773"
✅ Float epochs pass through unchanged: 1779329564.5102773 → 1779329564.5102773
✅ Type preservation: isinstance(row["data_source_date_processed"], float) assertion
✅ Backward compatibility: blob mode behavior is unchanged

✅ Final Recommendation

The datetime/timestamp conversion removal is properly implemented. The changes:

Eliminated complexity - No more datetime parsing/conversion logic to maintain
Improved performance - No type checking overhead in the hot path
Enhanced flexibility - Customers have full control over datetime field handling
Maintained backward compatibility - Non-flatten mode is unchanged
Clear contract - Well-documented expectation that customers declare datetime fields as STRING

This approach is cleaner and more maintainable than the previous datetime coercion approach. The implementation correctly fulfills the requirement to remove datetime/timestamp conversions while preserving all other functionality.

Status: ✅ Ready for merge

potter-potter

LG

awalker4 temporarily deployed to ci May 20, 2026 20:24 — with GitHub Actions Inactive

awalker4 temporarily deployed to ci May 20, 2026 20:25 — with GitHub Actions Inactive

awalker4 temporarily deployed to ci May 21, 2026 02:44 — with GitHub Actions Inactive

chore(PLU-161): changelog + bump to 1.7.0

db98886

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

awalker4 temporarily deployed to ci May 21, 2026 02:48 — with GitHub Actions Inactive

chore(PLU-161): drop point-in-time comments from volumes_table

69060f7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

awalker4 temporarily deployed to ci May 21, 2026 02:51 — with GitHub Actions Inactive

awalker4 temporarily deployed to ci May 21, 2026 18:24 — with GitHub Actions Inactive

awalker4 commented May 21, 2026

View reviewed changes

Comment thread unstructured_ingest/processes/connectors/databricks/volumes_table.py Outdated

refactor(PLU-161): drop datetime coercion from flatten path

0c31cb2

Let metadata fields flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields as STRING. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

awalker4 temporarily deployed to ci May 22, 2026 13:39 — with GitHub Actions Inactive

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

Comment thread test/unit/connectors/databricks/test_volumes_table.py

awalker4 temporarily deployed to ci May 22, 2026 15:35 — with GitHub Actions Inactive

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

chore(PLU-161): tighten missing-table precheck message

d6952d6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

potter-potter approved these changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(PLU-161): flatten_metadata option for databricks_volume_delta_tables#712

feat(PLU-161): flatten_metadata option for databricks_volume_delta_tables#712
awalker4 merged 13 commits into
mainfrom
austin/plu-161-flatten-metadata-volumes

awalker4 commented May 20, 2026 •

edited

Loading

Uh oh!

awalker4 commented May 21, 2026

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

potter-potter commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

potter-potter commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026 •

edited

Loading

Uh oh!

potter-potter left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

awalker4 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Example job run

Test plan

Uh oh!

awalker4 commented May 21, 2026

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

potter-potter commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PLU-161: flatten_metadata option

✅ Overall Assessment

🔍 Detailed Review

Implementation Quality: Excellent

Test Coverage: Comprehensive

Security & Robustness: Good

Documentation & Usability: Clear

🔧 Technical Notes

⚠️ Minor Observations

✅ Approval Recommendation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

potter-potter commented May 22, 2026

Uh oh!

claude Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PLU-161 (Updated Review)

✅ Updated Review - Datetime/Timestamp Conversion Removal

Key Changes Confirmed:

Implementation Quality: Excellent

Documentation & Contract:

Test Coverage: Comprehensive

✅ Final Recommendation

Uh oh!

potter-potter left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

awalker4 commented May 20, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

claude Bot commented May 22, 2026 •

edited

Loading

claude Bot commented May 22, 2026 •

edited

Loading